Our group chose to analyze data from the Boston Marathon in 2015-2017. Most of our work focuses on 2017, but we conducted analysis across all three years of data. The total size of the dataset is about 26,000 rows with 25 fields for each year. Fields include demographic information (age, gender, hometown and/or country) for each runner, the finishing time for each runner, and periodic times runners reached certain milestones along the course.
For a more detailed overview of our data, please see section III.
Our group was interested in determining what makes a runner fast based on the data we have? How much role, if any, does gender, age, past runnings of Boston, nationality, or home state determine a runners finial finishing time? To answer this, our team broke this question into the following component pieces.
Analysis of gender (Kay)
Analysis of age (Hemanth)
Analysis of past races (Akhil)
Analysis of nationality (Ruijin)
Analysis of home state (Patrick)
Looks like there are several fields we do not need or are blanck throughout the dataset. Let’s see what each column is doing:
X: A unique id for each observation
Bib: The bib for the runner. This is the number the runner wears on race day. There usually is some structure here such as “elite runners” getting low bib numbers. We can leave this as a factor because there are some bib numbers that are both letters and numbers. (See note at bottom on bib numbers.)
Name: Runner’s name. There are some repeat names here.
Age: Runner’s age. Makes sense as an int.
M.F: Male or female. Makes sense as factor.
City: City as submitted by the runner.
State: State as submitted by the runner. There are 69 total unique values for states, and 3595 are blank.
Country: Country as submitted by the runner. There are 91 total unique values with no blanks. 20,945 runners are from USA, the largest country represented.
Citizen: Citizenship as submitted by the runner.
X.1: Unclear. Looks like there are less than a hundred entries that appear to be states. Recommend dropping.
X5K - X40K: Times at different points along the race course. Of note, “Half” is 13.1 miles, and represents 21.08 km.
Pace: An interesting speed construct. It is the total minutes:seconds to complete one mile. Equal to total time divided by total miles.
Proj Time: This value is ‘1’ for all runners. Recommend drop.
Official.Time: The final time for the runner.
Overall: Rank amongst all finishers.
Gender: Rank in gender (2 categories).
Division: Rank in age/gender division (20 categories).
Note on Bib Numbers Bib numbers are given out in ranges based on the runner’s fastest qualifying time. Below is the info for 2017 (http://registration.baa.org/2017/cf/Public/iframe_EntryLists.cfm):
Bib numbers are color coded. Red bibs (numbers 101 to 7,700) are assigned to Wave 1 (10:00 a.m.). White bibs (numbers 8,000 to 15,600) are assigned to Wave 2 (10:25 a.m.). Blue bibs (numbers 16,000 to 23,600) are assigned to Wave 3 (10:50 a.m.) Yellow bibs (numbers 24,000 to 32,500) are assigned to Wave 4 (11:15 a.m.). The break between Wave 1 (10:00 a.m. start time) and Wave 2 (10:25 a.m. start time) is 3:10:43. The break between Wave 2 and Wave 3 (10:50 a.m. start time) is 3:29:27. The break between Wave 3 and Wave 4 (11:15 a.m. start time) is 3:57:18.
Do men and women perform differently? Are they different in age? Are these differences statistically significant?
Is there relationship between performace at half time (minutes it take to get to half) and official time (time it take to finish the race)?
Does your performace at each quarter of the race (rank at each quarter, only looking at time it take to run that quarter) impact your final rank?
There are 11972 female and 14438 male participants in the data frame.
Women’s age (x=39.9) is siginificantly different from men’s age (x=44.8) (p<0.001).
Women’s average oficial time (x=249) is siginificantly different from men’s average official time (x=229) (p<0.001).
Women’s average half time (x=117) is siginificantly different from men’s average half time (x=105), p<0.001).
SMART Questions
Are the average finishing times across different age groups differ ?
Are there any trends across different age groups ?
Is Age really a factor in deciding the finishing time of the marathon (Well, this boston marathon) ?
Let us add a column to the existing dataframe which tells us about the agegroup that particular runner belongs to. We have divided the age groups according to the USATF(USA Track & Field) standard i.e 5 years and also keeping in mind the number of runners in each age group. In 2017 boston marathon, min. age of the participants is 18 and max. age of the participants is 84.Usually, Marathons and long distance events often use 19 and under as the youngest age group. Following this, we have divided the age groups as 18-24,25-29,30-34,35-39,40-44,45-49,50-54,55-59,60+.
Lets visualize the data. First, we look at the total number of runners for each age group.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
As we divided the data according to the no. of runners in each age group as well, there are more than 1000 runners in each age group. Now, lets look at the boxplot of Official run time by age group so that we will get to know the trends across age groups.
By looking at the boxplot, we can observe that there isn’t much difference in the average running time of age groups from 18-24 till 45-49. Ofcourse the average running times differ from age of 50. And also, distribution of some of the age groups looks same except for few outliers.Boxplot suggests that runners in agegroup 30-34 are faster compared to other agegroups and runners in age group 60+ are the slowest. Now, lets subset the data according to agegroups before we go on to statistical tests.
Below are the QQ plots for all these subsets to check if they are normal.
Lets sample 50 observations from each subset of age groups and bind them together to a new data frame.
Lets perform ANOVA test to compare the means of the run times across different age groups. Null Hypothesis :- There is no significant difference between the means of run times across different age groups i.e in other words, Age has no significant impact in finishing times of runners.
## Df Sum Sq Mean Sq F value Pr(>F)
## agegroup 8 70939 8867 5.445 1.46e-06 ***
## Residuals 441 718133 1628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 1.959398
According to the results of the ANOVA test, pvalue is which is less than the significant level 0.05. We can formally reject the null hypothesis that there is no difference between the means. Lets take a look at the tukey comparison table to check the mean comparison between different age groups.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Official.Time.Min ~ agegroup, data = sample_ag)
##
## $agegroup
## diff lwr upr p adj
## 25-29-18-24 -10.769000 -35.9276796 14.389680 0.9205640
## 30-34-18-24 -17.109333 -42.2680129 8.049346 0.4609946
## 35-39-18-24 -4.091000 -29.2496796 21.067680 0.9998883
## 40-44-18-24 -13.894667 -39.0533463 11.264013 0.7329886
## 45-49-18-24 -13.312667 -38.4713463 11.846013 0.7766550
## 50-54-18-24 -2.128000 -27.2866796 23.030680 0.9999993
## 55-59-18-24 11.556000 -13.6026796 36.714680 0.8850655
## 60+-18-24 23.675333 -1.4833463 48.834013 0.0836800
## 30-34-25-29 -6.340333 -31.4990129 18.818346 0.9972185
## 35-39-25-29 6.678000 -18.4806796 31.836680 0.9960066
## 40-44-25-29 -3.125667 -28.2843463 22.033013 0.9999858
## 45-49-25-29 -2.543667 -27.7023463 22.615013 0.9999972
## 50-54-25-29 8.641000 -16.5176796 33.799680 0.9780804
## 55-59-25-29 22.325000 -2.8336796 47.483680 0.1285818
## 60+-25-29 34.444333 9.2856537 59.603013 0.0008069
## 35-39-30-34 13.018333 -12.1403463 38.177013 0.7974242
## 40-44-30-34 3.214667 -21.9440129 28.373346 0.9999824
## 45-49-30-34 3.796667 -21.3620129 28.955346 0.9999367
## 50-54-30-34 14.981333 -10.1773463 40.140013 0.6442827
## 55-59-30-34 28.665333 3.5066537 53.824013 0.0125198
## 60+-30-34 40.784667 15.6259871 65.943346 0.0000224
## 40-44-35-39 -9.803667 -34.9623463 15.355013 0.9530719
## 45-49-35-39 -9.221667 -34.3803463 15.937013 0.9673233
## 50-54-35-39 1.963000 -23.1956796 27.121680 0.9999996
## 55-59-35-39 15.647000 -9.5116796 40.805680 0.5871250
## 60+-35-39 27.766333 2.6076537 52.925013 0.0182424
## 45-49-40-44 0.582000 -24.5766796 25.740680 1.0000000
## 50-54-40-44 11.766667 -13.3920129 36.925346 0.8741492
## 55-59-40-44 25.450667 0.2919871 50.609346 0.0449487
## 60+-40-44 37.570000 12.4113204 62.728680 0.0001479
## 50-54-45-49 11.184667 -13.9740129 36.343346 0.9028519
## 55-59-45-49 24.868667 -0.2900129 50.027346 0.0554874
## 60+-45-49 36.988000 11.8293204 62.146680 0.0002051
## 55-59-50-54 13.684000 -11.4746796 38.842680 0.7491657
## 60+-50-54 25.803333 0.6446537 50.962013 0.0394364
## 60+-55-59 12.119333 -13.0393463 37.278013 0.8545510
Here, if we take a look at the p values of the age groups comparisons involving agegroups 55-59, 60+ the p-values are less than the significant level and all other age groups have high p values. So, lets omit the age groups 55-59 and 60+ and then see the trend between the remaining age groups. Lets perform anova again for the other age groups.
Null Hypothesis :- There is no significant difference between the means of run times across different age groups i.e in other words, Age has no significant impact in finishing times of runners.
## Df Sum Sq Mean Sq F value Pr(>F)
## agegroup 6 13168 2195 1.308 0.253
## Residuals 343 575685 1678
## [1] 2.125037
After performing the test, we got a p value of by which we can accept the null hypothesis that there is no difference between the means.
In conclusion, in boston marathon, if the runner’s age is below 55, Age is really not a factor in deciding the finish run time of the marathon.
SMART Questions Which age group is performing better over the years.
Any trends in the Average official time for top four countries with highest number of runners over the three years.
Let’s check whether the means of official time for the year’s 2015, 2016 and 2017 is same.
let’s check whether the performance of the third time runners is better than the first time runners by comparing their average official time in the year 2017.
From the above three plots, we can observe that 30-34 Age group has the lowest average official time over the years. So, we can conclude that 30-34 Age group people is performing well compared to other Age groups in all the years.
## 56 codes from your data successfully matched countries in the map
## 35 codes from your data failed to match with a country code in the map
## 187 codes from the map weren't represented in your data
## 53 codes from your data successfully matched countries in the map
## 26 codes from your data failed to match with a country code in the map
## 190 codes from the map weren't represented in your data
## You asked for 7 categories, 5 were used due to pretty() classification
## 51 codes from your data successfully matched countries in the map
## 28 codes from your data failed to match with a country code in the map
## 192 codes from the map weren't represented in your data
## You asked for 7 categories, 5 were used due to pretty() classification
As we can see from the plot, there is very slight increase in average official time for top four countries with highest number of runners over the three years.
Null hypothesis: There is no difference in means of the official time of runners for three years.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Official.Time.Min ~ year, data = official_time_sample)
##
## $year
## diff lwr upr p adj
## 2016-2015 3.6366667 -15.18589 22.45922 0.8911584
## 2017-2015 3.2500000 -15.57256 22.07256 0.9120556
## 2017-2016 -0.3866667 -19.20922 18.43589 0.9986966
p value 0.8814855 is greater than 0.05. Also from tuckey’s results, we can see the probability value is greater than 0.05 for all comparisons. So, we fail to reject the null hypothesis and we can conclude that the means of the official time is same for three years.
There are total 5458 runners participated in both 2015 and 2016 marathon and there are 2268 runners participated in all the three years.
Null Hypothesis for z-Test: No difference in average official times of first time and third time runners.
Alternate Hypothesis: Average official time of first time runners is greater than Average Official time of runners participated in all the three years. alpha:0.05
Since the p-value less than alpha 0.05, we reject the null hypothesis.That means the average official time of third time runners is lesser than first time runners. we can conclude that third time runners performance is better than the first time runners.
###Smart Questions I foucs on explore whether nationality affect marathon time, and specific questions include:
(1)Is marathon time same in different nationality/continents? If not, how is the difference?
(2)Among ‘fast runners’ groups (top 1/4), is marathon time still the same/different? How does the difference look?
(3)Would nationality/continent be independent with finish time?
###Expextation and Data Exploratory Before data exploratory we expect African group would be the fast and avearge finish time could be different. As for nationality/continent factor, we expect it not to be independent.
First we will a look at distribution of finish time among different countries
From finish time-nationaliy plot we can see, finish time is different among different country groups. But since there are more than 90 countries included, it is hard to read and get further exploration. So we decided to group countries by continents. Since American runners are most, we list USA as an individual part in Continent.
We download library countrycode, which helps us convert country code to continents. After checking, we found some country codes are not included in library countrycode, so we reviewed remained country codes and fixed. Now we got a new column called Continent. Six continents in total include Asia, Americas, Europe, USA, Africa and Oceania.
By continent, 20945 runners are from the U.S.,2944 are from Americas, 1614 runners are from Europe, 681 are from Asia, 194 are from Oceania, and 32 are from Africa.
Then we explore on Finish Time-Continent plot. Boxplot shows average finish time in Africa group is much lower than others. Besides Africa group, other groups all have outlier. We expect Africa group’s average time is lower and is differnt from others, then we do anova test to test whether out expectation is correct.
Before Anova test, first we explore whether each group is normal distibution.Besides Africa, other groups are all normal distribution. But because Africa group does not bias too much, we can continue Anova.
Anova test shows p value is very low, so we reject the null hypothesis that mean of official finished time are same, to find which groups are different, we go into hoc test.
Under 95% confidence level, we fail to reject the null hypothesis that two gouups of runners – who are from Oceania and Americas, Oceania and Europe have the same finshing time. And we reject the null hypothesis that other group’s finishing time are different. Therefore, runners from Oceania & American and Oceania & Europe have same finish time, but other continents group are different.
Then we remove outlier to see any changes.
We remove individuals who finish Marothon extremely slow.
After removing outlier, we test normal distribution and do anova test again.
The result shows after removing outlier p value is still extremely low, so we reject null hypothesis that finish time are same. Hoc test shows the same results as before that besides Oceania & American and Oceania & Europe, other groups are different in finish time.
## Outliers identified: 169 nPropotion (%) of outliers: 0.7 nMean of the outliers: 335.47 nMean without removing outliers: 235.58 nMean if we remove outliers: 234.93 nOutliers successfully removed n
Next we explore whether resluts are same in those fast runners. We subset and pick runners with shorest 1/4 finish time. The runners who finished race less that 207 mintutes is at top1/4.
Boxplot shows African runners still run faster than other five groups.
We do normal distritution test as before. Now all groups obey normal distribution. Anova test shows we reject null hypothesis that all groups have same means. Hoc test shows African has different finish time compared with any other group, and the remaning groups have the same finish time
Last but not least, we explore whether continent and finish time are independent in fast group. We did Chi-Test, and result show p value is small, so we reject the null hypothesis that two are independent. Therefore, continent has effect on Finish time.
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: conttable
## X-squared = 15051, df = NA, p-value = 0.0004998
I decided to focus how much of an impact the state you live in affects the average marathon time. Several specific questions are involved with that broader question:
How do runners from Massaschusets, the home state of the marathon, compare to the rest of the runners?
Are the top three states’ average finishing times similar?
Are there states that are particularly faster or slower than others?
Are there any trends across different regions?
We’ll explore elements of all of these questions in the data analysis below.
First, lets read in the data for 2017. I’m using the standard R ‘read.csv’ rather than the readr ‘read_csv’ since that package defaults all time to a specific time of day.
We have a dataframe with 26,410 total entries and 25 fields. The data dictionary for all of these fields is included in the data summary and introduction.
Lets make a few changes to the “Official.Time” field, which repersents the finishing time for each runner. This will allow us to display the data as a new field in minutes called “Official.Time.Min”. We’ll use the lubridate package to convert a charachter time in ‘Hours:Minutes:Seconds’ format to number of seconds, which we will divide by 60 to give us a total run time in minutes. This will let us compare running times more easily.
For this section of analysis, we’ll first explore the data and see how consistent the average race times are across the states. We’ll also look at how many runners come from each state and ensure there are enough runners from each state to be significant. Then, I’ll look at the states with the three most runners, and run an ANOVA comparison on the means of their times. To do this, I only need the fields ‘Official.Time.Min’ and ‘State’, so I’ll subset down to save some memory since there are over 26,000 observations. At first, I will include the ‘Country’ field so I can subset down to just the United States.
From running the summary function on our data, we learn some interesting information about where our runners come from. Nearly 21,000 of our 26,000 runners come from the United States, with the second-best repersented country being Canada. Additionally from looking at the states, Massachusets has over 4500 runners, more than double the next state. There are also almost 4000 blank “states”, which makes sense for runners from countries with administrative regions other than states.
When we look at the descriptive statistics from the official running time in minutes, we see evidence of a definite right skew. The median is 231.66, but the mean is 238.06, suggesting a significant right tail. The standard deviation is 42.15.
Now lets examine just runners from the United States, the focus for this analysis. Looks like we have 20,945 runners from the United States. Unsuprisingly, we are left with very similar summary statistics because US runners were such a large portion of the runners. The median is 233.12, but the mean is 239.7, again suggesting a significant right tail. The standard deviation is 42.65.
We are left with 57 total states in the dataframe. In addition to the 50 US states, there are also runners from:
AA, AE, and AP: These are overseas postal codes for Americans in Asia, Europe, and in the Pacific and are usually used by members of the US military.
DC: District of Columbia
GU: Guam
PR: Puerto Rico
VI: US Virgin Islands
Now lets look at some basic visualizations. We’ll want to first look at the total number of runners for each state, and the average race time for each state. Since the I will be closely examining two variables (count of runner per state and average finishing time per state) across the same data, I will always use yellow/golds when color repersents counts of runners and blues when color repersents the average finishing times. As these are the colors of the Boston Marathon, I thought them appropriate for the analysis.
So we can see that most states have less than 500 participants, and only a few have more than 1000. Most states seem to have an average race time between 220 and 240 minutes, with a few outliers. But we don’t know how many runners were in those states that are outliers. It could be there were just a a couple of runners who ran very fast or slow. We’ll have to look at these possibilities later on in our analysis.
Let’s merge our Total Runners count and Average Time into one DF, and then merge this data into our ‘US’ dataframe with all of the observations. This will make it much easier to plot different visualizations by allowing us to set threshold and only plot states where the count of numbers is above a certain threshold.
We can now make a boxplot for all of the states that with more than a certain number runners in the 2017 race. After trying several thresholds, I decided 60 runners was a good threshold for analysis. It included most states but removed several states with a small number of runners and a significantly different average race time.
I’ve also shaded the boxplots by the number of runners so that darker golds have more participants than boxblots with a lighter yellow. MA definitely has the most particpants, and a much slower time. Additonally, the IQR for some other states near MA like RI and NH are definitely slower than other states. Also, runners from Colorado seem to be FAST, but it looks like there arent that many particpants.
Now that we have this date prepared we can make a bar chart to better review the data. We’ll plot the height of each bar as a function of count, and color by the average time of the runner. We’ll also make two copies of this chart. The first will be sorted by total number of runners per state, and the second by average completion time. We’ll also only plot states that had more than 60 runners to prevent one or two “fast” runners from a state being over-weighted.
Now we are seeing some trends for different states. Four of the slowest ten states (MA, RI, NH, ME) are from New England. It looks like runners from New England and MA specifically are much slower. Also, CO looks like they are pretty fast and its easy to see here that they had over 500 runners.
Before we go on to some statistical tests to see if we can prove that New England is slower, let’s make a map to get some more info.
Let’s make a plot of average finishing time across all of the states. Since a few finishing times in states with small numbers of runners can dramatically skew the average, lets keep only looking at states where there are more than 60 runners. States with less than 60 runners will be greyed out.
A plot with our other variable of interest, total number of runners by state, was not very interesting because MA has so many more runners than any state. Even the #2 state, California, has so almost double the runners as the #3 state, New York. Therefore, I kept plotting focused on the average finishing time and excluded the map with counts by total runners.
I also diverged from my yellow/blue color scheme and chose to use the ‘virdis’ color pallete to better identify the contrasts between states.
Looks great! We can see that New England does look a little slower, so let’s do another map one with just New England.
For states in New England, we definitely see slower times. In fact, the mean finishing time for New Englanders is 262.5597352 minutes compared to a finishing time of 262.5597352 minutes for Non-New Englanders.
So the top states are MA, CA, and NY. Let’s first focus on these three states and see if the average finishing times are different for the top three states. Lets start by making a boxplot for each of these three states.
The boxplot really shows that MA runners are slower than those from CA and NY. The histogram for MA is also much less skewed to the right than CA or NY. This suggests there’s something different about runners from MA, aside from there being so many more runners from MA.
We can create QQ plots for all three of these datasets to see if they are normal. After reviewing all three, the MA dataset is the most normal of the three, with CA and NY having significant right-skews. It makes sense to sample all three of these datasets for our ANOVA testing. Let’s sample 50 observations and bind them together to a new dataframe.
Now lets look at the ANOVA test comparing these three means of the top three states. The null hypothesis here will be there is no difference in means between the three states. The alternative hypothesis is that there is a difference. If the p-value is significantly small (below .05 for a 95% confidence level), we will reject the null hypothesis and conclude there is a difference between at least some of the means.
After completing the test, we find that there is an incredibly small p-value of 1.750529910^{-6}, suggesting we reject the null hypothesis that there is no difference between the means. There is something different about these means. Lets take a look at Tukey’s multiple comparison of means.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Official.Time.Min ~ State, data = samples)
##
## $State
## diff lwr upr p adj
## MA-CA 32.68267 13.80031 51.56502 0.0002017
## NY-CA -7.83900 -26.72136 11.04336 0.5887870
## NY-MA -40.52167 -59.40402 -21.63931 0.0000033
When we look at NY to CA, Tukey’s family-wise comparison shows an incredibly high adjusted p-value. Let’s explore that more.
Let’s conduct a two-sample T-test comparing these two similar means for CA and NY. The null hypothesis for this test will be that CA and NY have the same mean finishing time. The alternative hypothesis is that NY and CA have different average finishing times.
The result of the T-Test is 0.25002, a very high p-value. Thus, we fail to reject the null which stated there was no difference in the means between CA and NY.
If we filter out states in New England, how different are the average finishing times for the rest of the country? To answer this question, we have to first do some data cleaning. We can limit our analysis to the 35 non-New Englanf states that had more than 60 runners. Then, we’ll take 50 random samples from each one of those states, and finally conduct an ANOVA analysis on that sample data grouped by states. The null hypothesis is that there is no difference amongst means from non-New England states with at least 60 runners. The alternative is that there is a difference.
After running the analysis, we get a high p-vale of 0.16196, meaning we fail to reject the null hypothesis. It looks like there is a high level of consistency for the average run time across the US outside of New England. When I ran this analysis before on states with fewer total runners at the race, my answers were far more variable. This is further evidence that we need to make sure we filter out states with only a handful of runners as a few fast or slow finishing times can affect the entire state’s average.
The final piece of analysis we’ll do is a two-sample T-test between New England states and non-New England states to compare average finishing times. The null hypothesis is that there is no difference in mean finishing times for states in New England compared to the rest of the US, while the alternative hypothesis is there will be a difference. Based on our previous analysis, we expect there to be a difference, but we should run the test to be sure.
The test returns a very low p-value of 7.760726810^{-6}, telling us we should reject the null hypothesis. Average race times from New England are different from averages around the rest of the country.